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Abstract 

Background: Reconciled gene trees yield orthology and paralogy relationships between genes. This information 
may however contradict other information on orthology and paralogy provided by other footprints of evolution, 
such as conserved synteny. 

Results: We explore a way to include external information on orthology in the process of gene tree construction. 
Given an initial gene tree and a set of orthology constraints on pairs of genes or on clades, we give polynomial- 
time algorithms for producing a modified gene tree satisfying the set of constraints, that is as close as possible to 
the original one according to the Robinson-Foulds distance. We assess the validity of the modifications we 
propose by computing the likelihood ratio between initial and modified trees according to sequence alignments 
on Ensembl trees, showing that often the two trees are statistically equivalent. 

Availability: Software and data available upon request to the corresponding author. 



Introduction 

A gene tree represents the evolutionary relationships 
between a set of homologous genes. Gene trees are use- 
ful to unveil the molecular evolutionary events that have 
shaped today's genomes. They are traditionally con- 
structed from sequence alignments [1], while recent 
methods also use the information from species phyloge- 
nies through reconciliation [2-8]. But constructing good 
gene trees is still challenging: for example, while they 
yield orthology and paralogy relationships between 
genes, often alternative or additional information, such 
as conserved synteny, is used to provide or confirm 
orthology [9]. 

The orthology information suggested by gene tree 
reconciliation may be contradictory with that suggested 
by an external source, such as conserved synteny [10,11]. 
We explore a way to reconcile them by performing slight 
modifications to a given gene tree in order to fit external 
information on orthology. 
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We propose two kinds of gene tree modification, 
which consist in computing a gene tree as close as pos- 
sible to the initial one, satisfying two kinds of con- 
straints. One kind is a set of pairs of genes that should 
be orthologous but are seen as paralogous in the initial 
tree. This occurs when orthologs are computed with 
synteny for example [11]. The other kind is a set of 
clades that should be rooted by speciation nodes but are 
rooted by duplication nodes in the initial tree. This 
occurs when dubious duplications are detected because 
of the absence of extant support for a duplication, or 
because of ancestral synteny information [10]. We give 
polynomial-time algorithms for both problems under 
the Robinson-Foulds distance, thus proposing several 
ways to improve gene trees according to external 
information. 

There are very few gene tree reconstruction methods 
including synteny information [12], whereas integrating 
this information could be valuable [13]. The modifica- 
tions we propose could be included in a local search fra- 
mework as other kinds of modifications based on 
duplications and losses [14-17]. We assess the validity of 
the modifications we propose by computing the likeli- 
hood ratio between initial and modified trees according 
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to sequence alignments on Ensembl trees [18], showing 
that often the two trees are statistically equivalent. 

Different gene tree corrections 

Phylogenies 

A phytogeny is a rooted binary tree which represents the 
evolutionary relationships between the nodes. Internal 
nodes are extinct ancestors, leaves are extant elements 
and edges represent direct descents between parents 
and children. Given a node x of a phylogeny T, we call 
an ancestor of x any node on the path from the root 
(inclusively) of T to the parent of x. For a leaf-subset X 
of T, lcar (X), the lowest common ancestor of X, denotes 
the farthest node from the root which is an ancestor of 
all elements of X. We use the notation l{x), and call the 
clade of x, the set of leaves which are descendant from 
an internal node x. We also denote by l(T ) the set of 
leaves, and by V(T ) the set of nodes of T. 

We define two kinds of phylogenies: species trees and 
gene trees. Species are identified with genomes. For our 
purpose, genomes are simply sets of genes. Therefore, 
each gene g, extant or ancestral, belongs to a species s 
(g). We then have one species tree S, where nodes are 
identified with species, and many gene trees, where 
nodes are identified with genes. The set of genes in a 
gene tree is called a gene family. 

A reconciliation between a gene tree G and a species 
tree S consists in assigning to each gene g of G (both 
extant and ancestral) the species s(g) corresponding to 
the lowest common ancestor in S of the set {s(l), for all 
I L 1(g)}- Every internal node g of G is labeled by an 
event E(g), verifying E{g) = speciation if s(g) is different 
from s(gt) and s(g r ) where g t and g r are the two children 
of g, and E(g) = duplication otherwise. 

The reconciliation of G and S gives all informations 
about the gene family history. In particular it defines the 
gene content of an ancestral species at the time of spe- 
ciation. A reconciliation also implies the orthology and 
paralogy relationships between genes: two genes g and g' 
of T are said to be orthologous if E(lca T (g, g')) = specia- 
tion; g and g' are paralogous if E(\c& T (g, g')) = duplica- 
tion. For example, Figure 1(1) shows a gene tree 
reconciled with a species tree. In this gene tree a x and 
b 1 are paralogous as their lowest common ancestor is d 
which is a duplication node, while al and bl are ortho- 
logous. The number of dots inside big circles represents 
the number of genes in the corresponding genome 
(each big circle represents a species). 

The Robinson-Fould (RF) distance 

The RF distance RF (G, G') between two phylogenies G 
and G' is the cardinality of the symmetric difference 
between the clade-sets of the two trees. In other words, 



denote by c(G, G') the number of clades that are in G 
but not in G. Then RF (G, G') = c(G, G) + c(G', G). 

In this paper, since we only compare rooted binary 
trees sharing the same leaf-sets, they always have the 
same number of internal nodes, and hence the same 
number of clades. Therefore c(G, G) = c(G', G), and RF 
(G, G) = 2c(G, G). 

Two correction problems 

Suppose that in addition to a species tree and a set of 
reconciled gene trees, we are given additional informa- 
tion of two kinds: 

• Pairs of genes that we know are orthologous; 

• Duplication nodes of some gene trees that we sus- 
pect to be false. 

Constraints of orthology on pairs of genes may for 
example be generated from synteny analysis [9,11]. 
Some pairs may contradict the information given by the 
gene tree. Let P be a set of pairs (g lt g 2 ) of orthologous 
extant genes (verifying s(gi) * s(g 2 ))- A gene tree G is 
said to satisfy a set P if, for any pair (g h g 2 ) e P, \ca G (g h 
g 2 ) is a speciation node. 
Problem 1 Gene Orthology Correction [GOC] Problem 
Input: A gene tree G reconciled with a species tree S, 
and a set P of gene pairs that are required to be 
orthologous; 

Output: A corrected gene tree G P satisfying P, such that 
RF (G, G P ) is minimum among all possible solutions. 

An example is given in Figure 1: (1) is the initial tree, 
and (2) depicts two syntenic regions of size 3 surround- 
ing genes bl and al. In general (if we neglect the effect 
of gene conversion) genes in two syntenic regions should 
be either all pairwise orthologous or all pairwise paralo- 
gous [11]. Consequently, if the two neighbors of bl on 
genome B and of al on genome C are inferred to be 
orthologous (according to their lowest common ancestor 
in their respective gene trees), then an orthology con- 
straint should be imposed on the pair (bl, al). Figure 1. 
This principle is usually considered as one of the most 
efficient method to detect orthologies [9]. (3) is a cor- 
rected tree. 

On the other hand, duplication nodes of a gene tree 
can be considered dubious for different reasons. For 
example, in Ensembl [19], "dubious" is a label assigned 
to the non-apparent duplication nodes [20,21] pointing 
to an incongruence between the gene tree and the spe- 
cies tree. Alternatively, inferred ancestral synteny may 
also point to dubious duplication nodes [10]. Formally, 
clades corresponding to some duplication nodes may 
erroneously be considered as sets of paralogous genes, 
and should rather be considered as orthologous. 
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Figure 1 Description of the two problems. (1) A gene tree (the "initial tree") for the gene family {c, 61, 62, al, a2) is shown with small red 
nodes and single thin red edges. It is reconciled with the phylogeny of the three species A, B and C shown with large green nodes and hollow 
edges represented by a pair of parallel black lines. Duplication nodes of the reconciled gene tree are squared, while speciation nodes and leaves 
are dots. (2) The two neighbors of 61 on genome B and of al on genome A are inferred to be orthologous according to their lowest common 
ancestor in their respective gene trees (not shown). This is an argument for infering orthology between 61 and o1, which is in contradiction 
with the information provided by the initial tree: their lowest common ancestor is a duplication, and thus they are inferred to be paralogous. 
(3) A solution to the GOC problem, that is a gene tree of minimum RF distance with the initial tree verifying the constraint of 61 and al being 
orthologous. (4) A solution to the COC problem, that is a reconciled tree in which the clade {61, 62, al, a2} of d in the initial tree is rather 
rooted by a speciation node in the corrected tree. This is an example where the optimal solutions to the two problems differ. 



A gene tree G is said to satisfy a set C of its clades if 
£(lca G (c)) = speciation for all c L C. 

Problem 2 Clade Orthology Correction [COC] 
Problem 

Input: A gene tree G reconciled with a species tree S, 
and a set C of clades of G assigned to duplication nodes; 

Output: A corrected tree GC satisfying C, such that RF 
(G, G c ) is minimum among all possible solutions. 

The two problems are different, as exemplified by 
Figure 1, where (3) is an optimal solution to GOC while 
(4) is an optimal solution to COC, the latter more dis- 
tant to the initial tree. 

In the next two sections, we use S for the species tree 
name, G for the reconciled gene tree, and we give effi- 
cient solutions to these two problems. 

The Gene Orthology Correction Problem 

Notice that for any instance of the GOC problem, a cor- 
rected tree satisfying P always exists. Indeed, for any 
extant species x of S, one can make a tree whose leaf- 
set is all the extant genes g of G for which s{g) = x. 
Doing this for every species yields a forest whose roots 
can be reconnected by matching the topology of S, 
ensuring that any pair of genes not in the same species 



are orthologous. However, the obtained tree can be very 
far from the original. 

Let P be a set of gene pairs (which are leaves of G) 
required to be orthologous. Notice that if (a, b) e P, 
then we also have {b, a) e P. For any pair (a, b) e P, if 
lca G (a, b) is a duplication in G, then {a, b) is a pair of 
false paralogs. The set PfQP denotes the set of all false 
paralogous pairs of P. 

Given two distinct leaves a and b of G, we set r ab = 
lca G (a, b), s ai b = lca s (s(a), s{b)), and define h a: b as the 
highest node (closest to the root) on the path from a to 
r a ib such that s{h ajb ) is a descendant of s a: b. Notice that 
h a: b can be a itself, but not r a b . 

For instance on Figure 2(1), « 1( c 2 are false paralogs 
with i~ai,c 2 = e 3 and S fll|C2 = E. From this, one can deduce 
that h ai ,c 2 = d2 and h ClAl = Cj. We show below that, for 
any pair (a, b) of false paralogs, h a>h is the highest node 
on the path from a to r ab over which we can move b to 
make lca G (a, b) a speciation node. The reason for mov- 
ing b as high as possible is to preserve as many clades 
as possible, allowing a minimum RF distance between 
the initial and corrected tree. 

Lemma 1 Let {a, b) be a pair of false paralogs in G, 
and let G' be a tree in which a and b are orthologous. If 



Lafond et al. BMC Bioinformatics 2013, 14(Suppl 15):S5 
http://www.biomedcentral.eom/1 471 -2 1 05/1 4/S1 5/S5 



Page 4 of 9 




(D (3) 

Figure 2 GOC Procedure. (1) A gene tree G reconciled with species tree S. Duplication nodes are denoted by a black square. The leaves and 
internal nodes of G are labeled with the letter of their corresponding species. Brackets denote the required orthologs given by the input set P = 
{(Oi, bi), (o q , c-0, (ai, c 2 )}. The non-preservable nodes (nodes of H) are depicted by red crosses, while preservable nodes are circled in green. (2) 
The species tree associated with G. (3) The tree G P , a solution to the GOC problem, which preserves every possible clade. 



x is an ancestor ofh ai b and a descendant of r ab , then the 
clade of x is not in G'. 

Proof: Suppose otherwise that there is some x L V (G) 
with the same clade as x (and hence s{x) = s{x')). Let r' a> 
b = IcaG' {a, b), which should be a speciation. Since b 
was not in the clade of x, it cannot be in the clade of x' 
either, implying that r' a b is an ancestor of x'. Also, since 
s{x') = s(x) and x is above h ai t, in G, we have that s(x) is 
s ai b or one of its ancestors (otherwise we would have 
picked X to be h ai b). But r has x in one of its subtrees, 
and b in the other, implying that r' a ,b is a duplication: 
contradiction. □ 

We now have a way to identify a set of clades that 
cannot be in G P . For any (a, b) e Pf, denote by H a b the 
set of ancestors of h a> i, that are descendants of r a h . If G P 
satisfies the set Pf, Gp cannot contain any clade from 
the set H = \->(a,b)ePfH a ,b. It follows that a minimum of | 
H\ clades of G are missing in G P . We claim that a solu- 
tion G P to the GOC problem is obtained by modifying 
exactly c(G, G P ) = \H\ clades. 

Theorem 1 Let Gp be a solution to the GOC problem. 
Then RF (G, G P ) = 2\H\. 

In what follows, we give a constructive proof of Theo- 
rem 1 by describing an algorithm for solving the GOC 
problem. 

An algorithm for the GOC problem 

Call V (G) \H the set of preservable nodes of G (those 
that we hope to preserve). For example in Figure 2(1), 

H = H„ 1 , Cl UH C2 , 0l UH„„ Cl UH e „ ai UH„„i, 2 UH 1 , 1 , ai = {e 1 }U{e 2 )U0U0U{ 1 i 1 }U0 = (ei, e 2 , dl}. 

The nodes of H are represented by red crosses, while 
the preservable nodes are circled in green. Notice that 
the root r of G is preservable, since any solution G P to 
the GOC problem should share the same leaf-set as G. 



Consider the set Q of subtrees of G rooted on the high- 
est preservable descendants of r, i.e. preservable nodes 
for which r is the unique preservable ancestor. Observe 
that since any leaf of G is preservable, we have 
u G x egK G x) = '(G). If, for some (gy g 2 ) e P, g l and g 2 are 
scattered across two subtrees of G, we call these sub- 
trees required orthologous subtrees (or simply required 
orthologs when the context is clear as to whether we are 
comparing genes or subtrees). For example in the tree G 
of Figure 2(1), G is the set of subtrees rooted at d 2 , c lt 
b$ and c 2 (the last four restricted to a single leaf), and 
the subtrees rooted at d 2 and C\ are required orthologs, 
as well as those rooted at d 2 and c 2 . However, connect- 
ing two subtrees under a speciation might not always be 
feasible. A definition of possible orthologs follows. 

Definition 1 (Possible orthologs) Two subtrees G\, 
G 2 e Qrooted at respectively are possible orthologs 

if and only if s(xi) and s{x 2 ) are unrelated, i.e. neither is 
an ancestor of the other in S. 

The following lemma ensures that the roots of 
required orthologous subtrees can actually be joined 
under a common parent which is a speciation. 

Lemma 2 Let G 1( G 2 e Q be required orthologs. Then 
Gi and G 2 are possible orthologs. 

Proof. Let Xi, x 2 be the roots of G 1; G 2 respectively, 
and let (g lt g 2 ) e P such that gi e 1{G\) and g 2 e l(G 2 ). 
Let s^ s r be the left and right children of s gi,gi, and 
denote by Sg and S r the subtrees of S rooted at Sf and 
s r respectively. Suppose without loss of generality that 5 
(g^) is in l(Sg) and s(g 2 ) is in l{S r ). Since x l is preserva- 
ble and on the path between g 1 and r gi,g2, we have 
X\ ^ H gli g 2 and thus s(xi) e V (Se). Similarly, s{x 2 ) e V 
(S r ). Therefore 5(^1) and s(x 2 ) are unrelated and possi- 
ble orthologs. 
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The problem, formally defined in the sequel as the 
maximum orthology tree, consists in joining all trees of 
Q into a single tree G' in a way ensuring that each pair 
of possible orthologs is joined under a speciation. More 
precisely, for some possible orthologs G\, G 2 e rooted 
at nodes x y x 2 , we get that lca G -(#l, x 2 ) is a speciation, 
with Gi, G 2 being unchanged. 

We begin by giving an overview of the whole 
algorithm. 

Algorithm Outline: 

1. Compute the set H = ^{a,b)eP f H a ,b of internal nodes 
of G corresponding to clades that cannot be in G P ; 

2. Compute the set Q of subtrees rooted at the highest 
preservable descendants of the root of G. If Q is empty, 
return G and terminate; 

3. Construct a tree G' by joining all trees of Q in a way 
ensuring that possible orthologs are joined under specia- 
tion. We call G' the maximum orthology tree for Q; 

4. For every tree G x e Q, construct G XiP by recursively 
repeating Steps 2 to 4 with G being G„ and replace the 
G x subtree of G by G XiP . 

The tree obtained corresponds to the corrected tree 
G P we want. Running this algorithm on the G tree of 
Figure 2 yields the corrected tree Gp. This algorithm 
terminates, since we eventually reach all the leaves of G, 
which correspond to terminal cases in the recursion. 
Implementing step 1 is straightforward, while step 2 can 
be done by performing a depth-first search from the 
root, in which upon visiting a preservable node, we add 
it to Q and continue the search without visiting its chil- 
dren. Step 3 is the purpose of the next section, so 
assume for now that it can be performed correctly as 
stated. This algorithm can be implemented to run in 
0(\P | x | V" (G)|) steps in the worst case, the main bot- 
tleneck being the computation of H. The algorithm cor- 
rectness follows from the two lemmas below. 

Lemma 3 Any preservable node x of G is preserved in 
Gp, meaning that the clade of G rooted at x is a clade of 
G P . 

Proof: Let x be a preservable node of G and G x be the 
subtree rooted at x. It is not hard to see that eventually, 
steps 2-4 will be run on G x and return a tree G XiP , 
which will itself be a subtree of the final corrected tree 
G P . As the algorithm only moves and reconnects sub- 
trees of G x , we have that l(G x ) = l(G xP ). Since G Xi p is a 
subtree of G P , it follows that the clade of x is preserved 
in G P . 

Lemma 4 Let (gx, g 2 ) e P. Then g 1 and g 2 are ortho- 
logs in Gp. 

Proof. Denote by G v the subtree rooted at v, for some 
v e V (G). Let x be a preservable node and G x p be the 
subtree produced after running steps 2-4 on G x . Let D 
be the set of highest preservable descendants of x. We say 
that a gene pair (gx, g 2 ) is contained in G x itgx, g 2 e l(G x ). 



We use induction on the height of the tree to show that 
all gene pairs in P that are contained in G x are orthologous 
in G XiP (which proves the lemma since x can be the root). 
This is trivially true for leaves as they are preservable and 
contain no gene pairs. We thus suppose by induction that 
for any d e D, gene pairs in P that are contained in G^ are 
orthologous in G d>P . Let (gx, g 2 ) e P such that (g lt g 2 ) is 
contained in G x , but there is no d e D such that Gd con- 
tains {g\, g 2 ). What is left to prove is that gx and g 2 are 
orthologous in G xP . 

We first observe that gx, g 2 belong to two different 
subtrees Gd l ,Gd 1 , where d lt d 2 e D. Otherwise 
G^ = Gd 2 , implying that (gx, g 2 ) is contained in Gd x and 
we are done. Therefore, Gd v Gd 2 are required orthologs, 
and hence possible orthologs. Since we may assume that 
Gji and G^ 2 are joined under a speciation in G XiP , we 
get that lcaG xP (gi,g2) is a speciation. The result follows 
from observing that G xP is a subtree of G P . 

Maximum orthology tree 

We now describe a solution to the maximum orthology tree 
problem. Formally, given a set of k possible orthologous 
subtrees of G rooted on a set of nodes X = {xx, . . . , x k }, the 
problem is to construct a tree F with 1(F) = X, such that for 
each pair x b Xj e X that correspond to roots of possible 
orthologs, Xi and x t are orthologous in F. 

Roughly speaking, the algorithm proceeds as follows: 
start with F 0 being a copy of S. Iterate over i from 1 to 
k, at each step constructing F t by grafting x t on F t .x right 
above the node v e V (F 0 ) such that s(v) = s(xi). 
Proceeding this way, we show in Lemma 5 that nodes of 
V (F 0 ) are ensured to remain speciation nodes all over 
the procedure, and in lemma 6 that the lowest common 
ancestor of two possible orthologs belongs to V (F 0 ), 
leading to corollary 1 stating that possible orthologs 
are in fact orthologous in the output tree. Finally 
remove the leaves artificially introduced by F 0 and stan- 
dardize the tree, which means 

♦ remove all nodes with no descendant labeled with 
extant genes; 

♦ contract non-root degree 2 nodes, then contract 
the root if it is of degree one. 

Starting with F 0 being a copy of S is a step that might 
be omitted, but the set of nodes V (F 0 ) serves as a skele- 
ton around which we graft our x/s, making it both easily 
implementable and provable. Figure 3 shows how the 
algorithm proceeds on the set of highest preservable 
descendants of the root of the tree G in Figure 2(1). 
Algorithm 1 findMaxOrthology(S, X = {xx, ... , 
F 0 9f A copy of S 
V 0 %V (F 0 ) 
L W l(F 0 ) 
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Figure 3 The Max Orthology problem. An instance of the max orthology problem, with X being the highest preservable descendants of the 
root of G in figure 2. (1) The starting tree F 0 , which is a copy of 5. (2) The F k tree, which depicts the tree obtained after grafting every node of 
X. (3) The final tree F, obtained by removing the leaves initially in F 0 and standardizing. 



for i ' = 1 — > k do 

Find the unique node v e V 0 such that s(v) = s(Xi) 
Fi 91 a copy of F, < . 2 on which we graft x t on the 
edge linking v to its parent node (or if v is the root of 
F t . lt create a new root with children v and x t ) 
end for 

F 9lF k on which we remove L and stardardize 

Lemma 5Ifre V (F 0 ) n V (F), then r is a speciation. 

Proof: Since F 0 is a copy of S, all nodes of V (F 0 ) are 
initially speciation nodes. We show that each grafting 
operation does not change the event corresponding to 
these nodes. Say that at iteration i, we graft x t on the 
edge linking v to its parent node p. We first observe 
that the only nodes that can be transformed from spe- 
ciation in F^to duplication in F t are on the path from p 
to the root of F^j. Suppose without loss of generality 
that v is the left child of p in F^ lt and let w be the 
newly created node between p and v in F t . Thus w has 
children xi and v, and since s{xj) = s{v), we get that 
s(w) = s(v). It follows that if p was a speciation in F^ lt it 
remains a speciation in F t . Moreover, this implies that s 
(p) is left unchanged in F t , implying in turn that any 
ancestor of p cannot change from speciation to duplica- 
tion. Therefore, no grafting operation can affect specia- 
tion of any vertex in V (F|-j). Finally, we note that 
removing leaves or deleting degree two nodes in F also 
cannot affect speciation nodes. 

Lemma 6 Let x b Xj e X be the roots of possible ortho- 
logous subtrees. Then, lca F (x^ xj) e V (F 0 ). 

Proof First recall that if the roots of possible 

ortholog subtrees, then there is some s e V (S) such 
that s(xi) and s(xj ) are in the left and right subtrees of 5, 
respectively. Now, let r be the unique node in V {F 0 ) 
such that s(r) = s, and let Vj, v ; - e V {F 0 ) such that s(v,) = 
s{xi) and s{vj) = s{xj). It is clear that in F 0 , lca(v £ ,v / ) = r. 
This also holds for any F t by observing that grafting 
nodes cannot change the lea relationship. Since x t is 
grafted on some edge between v t and r, and Xj between 
Vj and r, it follows that lcafe Xj ) = r e V {F 0 ). 



Corollary 1 Let x b x< e X be the roots of possible 
orthologs. Then they are orthologous in F. 

The Clade Orthology Correction Problem 

We prove several results characterizing the solutions to 
the COC problem. Let C be a set of clades that has to 
be satisfied. For a clade c e C, we denote by s(c) the 
value of s(r(c)) where r(c) is the root of c, and by E{c) 
the value of E(r{c)) that we call the label of c. 

First, unlike in the GOC problem, a solution to the 
COC problem does not always exist. Indeed, it is possi- 
ble that no gene tree has all clades in C labeled by spe- 
ciations. We give a necessary and sufficient condition 
for the existence of a solution. The following lemma is 
obvious from the definition of reconciliation, and will be 
used in several proofs. 

Lemma 7 For a reconciled gene tree G, if a node x is 
an ancestor of a node y and s(x) = s(y) then E(x) = 
duplication. 

Theorem 2 There is a solution to the COC problem if 
and only if for every clade c e C, s(c) is not a leaf of S, 
and if for every pair c lt c 2 e C, either c l and c 2 are dis- 
joint sets of leaves, or s{c-j) * s(c 2 ). 

The necessity of these conditions directly follow from 
Lemma 7, since s(ci), s(c 2 ) and the ancestry relationship 
between C\ and c 2 remain unchanged in a solution. 
Their sufficiency will be constructively demonstrated in 
the sequel. Suppose that the conditions are satisfied. We 
give a way of finding all optimal solutions according to 
the RF distance, followed by two ways of finding an 
optimal one optimizing other criteria in addition. 

Given a duplication node x of G, pushing x by multi- 
furcation means applying the following procedure: 

• Let s = s(x), and A and B be the two children of s 
in 5. 

♦ Let T A be the set of maximal subtrees of the sub- 
tree of G rooted at x, such that all their leaves / ver- 
ify that s{l) is a descendant of A (including A itself). 
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Let G A [x] be the multifurcated tree obtained by join- 
ing all roots of trees in T under a common root. 

♦ Let symmetrically T B be the set of maximal sub- 
trees of the subtree of G rooted at x, such that all 
their leaves / verify that s{l) is a descendant of B 
(including B itself). Let G B [x] be the multifurcated 
tree obtained by joining all roots of trees in T B 
under a common root. 

♦ Let G' be obtained from G by replacing the clade 
rooted at x by a new subtree, obtained by joining G A 
[x] and G B [x] under a common root. 

This rearrangement is described in [16] and applied to 
dubious duplications as a preprocessing step for ances- 
tral genome reconstruction. 

A binary resolution G b of a multifurcated tree G is a 
binary tree in which all the clades of G are in G b . 

Theorem 3 If there is a solution to the COC problem, 
then a binary gene tree is an optimal solution if and 
only if it is a binary resolution of the multifurcated tree 
obtained by pushing the roots of the elements of C by 
multifurcation (in any order). 

Proof: It is clear that a binary resolution is a solution, 
provided that the conditions for the existence of a solu- 
tion are satisfied. Indeed any clade is preserved through 
pushing a duplication node, so this operation can be 
done for all clades in C independently. This proves the 
converse part of Theorem 2. 

Then it is an optimal solution because by Lemma 7, 
no clade x which is a descendant of the pushed clade c 
such that s(c) = s(x) may be conserved if we want c to 
be a speciation node. And by construction all clades 
such that s(c) * s(x) are preserved by this operation. 

Binary resolutions which minimize the number of 
duplications and losses are studied by [22] and may be 
applied to provide bona fide phylogenies. We describe 
an alternative maximizing the number of common tri- 
plets. A triplet in a tree G is a set of three leaves {{a, b), 
c) of G, such that the LCA of the three is strictly more 
ancient than the LCA of the first two. 

Given a species tree S, a reconciled gene tree G and 
one of its duplication nodes x, pushing x by tree dupli- 
cation means applying the following procedure, illu- 
strated in Figure 4: 

♦ Let s = s(x), and A and B be the two children of s 
in S. 

♦ Let G [x] be a tree obtained from the subtree of G 
rooted at x, by deleting all leaves / with s{l) being a 
descendant of A, and standardizing it, which as in 
the previous sections, means 

- removing all nodes with no descendant labeled 
with extant genes; 



- contracting non-root degree 2 nodes, then con- 
tracting the root if it is of degree one. 

• Let symmetrically G B [x] be a tree obtained from 
the subtree of G rooted at x, by deleting all leaves / 
with s{l) being a descendant of B, and standardizing 
it. 

♦ Let G' be obtained from G by replacing the clade 
rooted at x by a new subtree, obtained by joining G A 
[x] and G B [x] under a common root. 

Note that if a clade y is disjoint from x or assigned to 
a different species, then pushing x by tree duplication 
does not affect the subtree rooted at y. In consequence, 
pushing several clades by tree duplications in any order 
gives a unique solution if the clades satisfy the proper- 
ties of Lemma 2. 

Theorem 4 If there is a solution to the Clade Orthol- 
ogy Correction problem, the gene tree obtained by succes- 
sively pushing the roots of the elements of C by tree 
duplication (in any order) is an optimal solution. Among 
all optimal solutions, it maximizes the number of com- 
mon triplets with G. 

Proof. As already noticed pushing a duplication by 
multifurcation preserves all clades assigned to species 
which are different from the species assigned to the 
pushed node. So it is an optimal solution. 

Now we have to prove that none of the triplets that 
are in G but not in G' can be preserved in any other 
optimal solution. For this we characterize the triplets 
that can be preserved. For a triplet {{a, b), c) of G, let 
T((a,b),c) be the rooted phylogeny with three leaves and 
two internal nodes containing the triplet. If the leaves a, 
b, c are in the pushed clade x, then the triplet can be 
preserved only if in the reconciliation of T(( a ,b),c)> the 
lowest internal node is not mapped to s(x). Otherwise 
by Lemma 7, the root node of the triplet cannot be a 
speciation. 

Let {(a, b), c) be a triplet such that in the reconcilia- 
tion of T(( a ,b),c)> the lowest internal node is not mapped 
to s{x). This triplet is entirely included in G 1 ^] or G 2 [x]. 
So it is preserved. In consequence all triplets possibly 
preserved are indeed preserved by the operation, show- 
ing the optimality of the procedure reguarding the num- 
ber of common triplets. 

Now if there is no solution to the Clade Orthology 
problem, we advice to push duplication nodes in C 
starting from the highest ones, without having forma- 
lized why we find this solution adequate. 

Fish gene trees 

Using synteny as evidence of orthology, we wanted to test 
the ability of our algorithm designed for the GOC pro- 
blem to correct gene trees. To this end, we considered 
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the four fish genomes Gasterosteus aculeatus (Stickle- 
back), Oryzias latipes (Medaka), Tetraodon nigroviridis, 
and Danio rerio (Zebrafish) with human and mouse as 
outgroups. We used the Ensembl Genome Browser to col- 
lect all available gene trees, and filtered each tree to pre- 
serve only genes from the taxa of interest. We then 
reconciled the trees with the known species trees, and 
identified duplication and speciation nodes. Following 
our methodology in [11], a region surrounding a gene is 
defined as the substring containing the gene and both its 
left and right adjacencies, and two regions are considered 
syntenic if they contain homologous genes in the same 
order. We observed in [11] that more than 22% of the 
6241 collected gene trees contain at least one false paral- 
ogy, that is a pair of genes required from synteny to be 
orthologous, but the LCA of the corresponding leaves 
being a duplication rather than a speciation node. 

For 1000 of the trees containing at least one false 
paralogy, we applied the correction procedure previously 
described, and retrieved the gene family alignment from 
Ensembl. With PhyML [23], we computed the likelihood 



of the initial and corrected tree, given the alignment. 
These two likelihood values were compared with Consel 
[24]. For only 17.7% of the trees, the correction was 
rejected by the AU test. In other words, the correction 
algorithm is valid for a vast majority (82.3%) of the 
tested trees. Moreover, the likelihood of the corrected 
tree is higher than the original for 44.4% of the trees. 
Interestingly, 14.8% of the original Ensembl gene trees 
were rejected when compared to the corrected trees. 

The correction of the gene tree for the ZNF800 gene 
family, which is related to transcriptional regulation, is 
given as an example in Figure 5. The corrected tree was 
highly favored by the AU Test, giving it a statistical sup- 
port advantage with a p-value below 0.001. Furthermore, 
the non-apparent duplication of G, located at the root of 
the (my t\, Si) subtree, was eliminated, resulting in one 
less duplication in G P . 

Conclusion 

We give two efficient algorithms for two new gene tree 
rearrangement problems, related to the correction of a 




(1) (2) (3) 

Figure 5 An example of corrected fish tree. The tree for the ZNF800 gene family before and after correction, restricted to the species 
Stickleback (S), Medaka (M), Tetraodon (T) and Zebrafish (Z). (1) The original gene tree G given by Ensembl, using the same notation as in figure 
2 for duplications, preservable nodes and required orthologs. Gene region analysis gave us the required orthologs P = {{m,, 5 q ), (r,,^)}. (2) The 
species tree associated with the four species. (3) The gene tree given by our correction algorithm. 
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gene tree according to some external information on 
orthology. The rearrangements are modifications that 
are as small as possible, given some distance criterion 
(namely the RF distance), but can be more significant 
according to other distances such as the usual NNI 
(nearest neighbor interchange) distance. We show that 
for fish genomes, the rearrangements we define can be 
efficient to explore statistically equivalent gene trees 
when sequence alignement is used to compute likeli- 
hood. As corrected trees satisfy synteny contraints, we 
can be confident enough that they describe the gene 
family evolution better. 

Many algorithmic and theoretical problems remain 
open. For example, is there a similar way for handling 
paralogy constraints? What about having both orthology 
and paralogy constraints? It can be shown that there 
exist sets of constraints with both types that cannot be 
satisfied. What are the conditions for a set of orthology/ 
paralogy constraints to be satisfiable? 

These algorithms may be used in a global framework 
to contruct large gene tree sets which are arguably bet- 
ter than those found in standard databases. The imple- 
mentation of such a framework is an on-going work. 
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